This assignment is for ETC5521 Assignment 1 by Team WOMBAT comprising of Hai Hanh Ngo and Dewi Lestari Amaliah.
Obtaining data from hotel industry has always been a real challenge. While we can read about the hotel jargons and terms easily, scanned through some of the common descriptive statistics of the hotel industry in market research reports or seems to stand on top of the world by pocketing some of the “industry secret” provided by the Internet for price negotiating for your next summer trip, less is known about how a hotel actually runs behind the scence. This sad truth was foreseable as no hotels would be willing to share their performance to the public eyes. Such limitation posed a challenge for researchers, scientists, hotelliers and many others to perform any studies relating to this industry.
The “hotel booking demand datasets” compiled by Nuno Antonio, Ana de Almeida and Luis Nunes was a beautiful effort to overcome such challenge. This dataset was obtained from two hotels in Portugal - one city hotel in Lisbon and one resort in Algrave. Some of the sensitive information that could reveal the identity of the two hotels was not provided, but it did not affect the important role of this dataset for the purpose of education, management, machine learning and many others.
In this study, we performed some initial analysis on the dataset to help us answer one of the question we mentioned at the beginning, about how hotels performance actually looks like behind the scence. What can we learn about the way hotels manage their customers’info, how they priced our bookings or who are more likely to cancel their bookings.
We will mainly use R programming software for this analysis along with other research materials we could find on the internet. Graphical displays are mostly utilised to support our diagnostics and analysis.
In 2018, Nuno Antonio, Ana de Almeida and Luis Nunes published a data article called “Hotel booking demand datasets” which introduced to the public two datasets for two hotels in Portugal, one resort in Algarve (H1) and one city hotel in Lisbon (H2). These datasets were originally collected to serve the development of prediction models to assess the hotels’ likelihood of having their bookings cancelled, however the uses of variables included also allow other researching purposes.
The datasets for the two hotels can be downloaded separately at https://www.sciencedirect.com/science/article/pii/S2352340918315191. However, jthomasmock at tidytuesday challenge had done us a favor and combined the two datasets into one. The two original datasets and the combined one can be obtained at https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md. For this study, we used only the combined dataset.
The dataset is stored in a csv file format with 32 variables and 119,390 observations (40,060 for H1 and 79,330 for H2). Each observation represents one hotel booking. The format of the dataset is as below:
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resor…
## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75,…
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 20…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July"…
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27…
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4,…
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "B…
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GB…
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corpora…
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corpora…
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C"…
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C"…
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposi…
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", …
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL"…
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_type <chr> "Transient", "Transient", "Transient",…
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00…
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3,…
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out",…
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 2…
Knowing this dataset belonged to a hotel, we can make sense most of the variables. However, not all of them are familiar for anyone who did not have a background of hotel management. We will run through some of the industry jargons and variables’meaning before we take a further look at the data.
Variables note:
You may notice that some variables contained the “NULL” values (eg: agent or company variable). This “NULL” value did not mean the value was missing, rather such value did not exist to begin with; for example a booking may not have the ID of an agent or a company associated with it as such booking was made by an individual traveller.
We started with 31 variables to look at, and really without any distinct direction, so we started off by looking at some of the most prominent features of hotels.
Tourism industry cares a lot about seasonality as it would affect the guest flows to tourist locations. Hotel season divided into two main seasons: the high and low season. As the name implies, high season is a busy season when the weather is favourable and the guests inflows are high; low season vice versa. Portugal is of no exception. The high season in Portugal usually runs in summer (June to September) and spring time (January to March) ; the beaches are usually the busiest in July and August. The low season usually falls in the winter season which starts around November and ends by the end of February. Weather during this period may feature humidity, unexpected rain and strong, cold breeze which is not too ideal for sightseeing.
Keeping the season time in mind, we are interested in finding out any possible influence seasonality may have on the guest count and adr of our hotels in study. The analysis in this section took into account only “Check-out” or successful bookings to avoid inflating numbers.
Guest Count
First, we tried to plot the guest count by month for each year to check if the seasonality is consistent through years (it should be unless some one - off events took place and disturbed the market). Problems arised when we wanted to look at the yearly subset of our records and as our dataset ran from 1 July 2015 to 31 August 2017, we could not group the data by their conventional year of 2015, 2016 and 2017. We decided to break down the dataset into three groups by allocating 12 months into each and called them year 1,2 and 3. As Year 3 only had 2 months, we dropped that year and focused only on two Year 1 (July 2015 - June 2016) and Year 2 (July 2016 to June 2017). The result is presented in figure 5.1 below.
Figure 5.1: Number of guests by month
The months are presented in the order of occurence (July was the first year for each year) to ensure the choronological order of the dataset. The overall trend was roughly the same for both years and both hotels. The seasonality trend resembled the “W” shape with the lower points of W fell into the winter months from November to January and the peak points were in spring and summer time. Interestingly, both hotels saw the higest of guests in Spring time in Year 1 (May and March) while in year 2, the highest was observed in the summer time (August and October).
There were some anormalities, however, spotted, of which the most noticable was that City hotel suffered a serious hit in guest counts in July 2015 while Resort saw no such event. This drop could be due to some hotel - specific reasons, for example the hotel might have closed down for some time for renovation; however we were unable to answer due to the lack of information of hotel’s identity. Other abnormal notice could be a reverse trend in March of Resort. Again this abnormality could not be explained and we believed it was due to hotel - specific cause rather than industry - specific.
Next we look at the Average Daily Rate or ADR in short.
Figure 5.2: ADR by month
Unlike figure 5.1, figure 5.2 showed a striking difference between the two hotels. Resort had the highest rates in the summer time, which make sense as Algrave is a beach area. Both hotels recorded the lowest rates in winter, however City hotel observed less fluctuation than resort.
We noticed that in the dataset, some of the bookings recorded a very low ADR (less than 10 euros), some may even down to 1 Euro for non - complimentary rooms in both hotels. This may seem a little strange and could created an impact on the ADR we plotted above in figure 5.2. However, we chose not to exclude those adr as we did not know the pricing policy of the hotels in question.
Another thing we can look at for the sake of comparison is the reservation status of these two hotels at figure 5.3 below:
Figure 5.3: Reservation status by hotel type
Cancellation seems to be a problem in city hotel. Visually, the number of cancelation in city hotel was way higher than resort. However, since the total number of bookings in the two hotels were very different to begin with, we might want to check the proportion rather than the absolute number.
| hotel | Canceled | Check-Out | No-Show | total_bookings | prop_cancel |
|---|---|---|---|---|---|
| City Hotel | 32186 | 46228 | 916 | 79330 | 0.4057 |
| Resort Hotel | 10831 | 28938 | 291 | 40060 | 0.2704 |
In table 5.1, the overall cancelation rate of 3 years posed a striking difference between city hotel and resort: 40.57 percent vs 27.04 percent. What factor could possibly contribute to such difference? Did the problem come from the hotel itself or from whatever happened in the overall tourism industry at that point in time? To answer that question, we proceed to find out the common points in both hotel’s cancellation pool.
How do the cancellations have in common?
The Hotel’s PMS recored the lead time, which is the number of days elapse between the entering date of the booking into the system and the arrival date, the reservation status date and the days in waiting list, which is the number of days before the booking was confirmed with the customers. A high days in waiting list indicating that a booking takes longer to proceed and a high lead time indicating that the booking was made earlier prior to the arrival date.
Figure 5.4: System days of cancellation bookings
Both hotels seems to proceed customers’ bookings with great speed with the median waiting days at around 0 days. However, city hotel had a lot of cancelled bookings that sat through a significantly high number of days in the system before they were confirmed to the customers. On other note, city hotel had a high range of lead time and days in the system, signalling that bookings which were made too early prior arrival time may have higher chance of being cancelled.
We will look futher at this assumption, but this time comparing check-out with cancelled bookings to verify the impact of lead time and days in waiting list on the cancellation rate.
Check-out vs Cancelled
We want to compare the days in waitng list and days in system for both hotels. Will a booking that takes too long to proceed (high days in waiting list) or is made too early before the arrival date (high lead time) more prone to cancellation?
Figure 5.5: System days of Check-out and Canceled
At the first glance at figure 5.5, both types of bookings still maintained a median days of waiting as 0 days, cancelled bookings had a lower days in system compared to check-out, which was no surprises as cancellations usually are done days before the arrival date. However, lead time for cancelled bookings were much higher than of Check-out, except for the two outliners of check-out which exceed 650 days. This confirmed our assumption of lead time having an impact on the cancellation chance. The earlier the booking, the higher the chance of cancellation.
The one who got away
Having known which kind of the bookings are more likely to get cancelled, we are interested in knowing who were the one behind those bookings. We started by looking at the market segment of these cancellation:
Figure 5.6: Who are more likely to cancel?
We calculate the cancellation rate by taking the number of cancellation bookings divided by total bookings for each market segment. From figure 5.6, we see that Groups had an impressive rate of cancellation at 68.44% for City hotel and 42.14% for Resort, followed by Travel agency/ tour operator (both offline and Online). City hotel noticably had the top 3 cancellation percentage at a much higher level than Resort. Aviation or Air crew only showed up in City hotel, which actually made sense as airlines only need a place for their air crew to stay over and it does not have to be at a fancy resort.
Now that we know Groups were mostly the ones who defaulted the most, we hope to know the why they did what they did, i.e cancelled their bookings. Was this because of the deposit policy or their requests not being fullfilled? We moved on to look at deposit and other guest’sspecial requests in search for the reasons.
Deposit policy
We gathered all the cancelled bookings and divided them by their deposit type.
Figure 5.7: Deposit type
Figure 5.7 showed us something really interesting. If we looked at the absolute numbers of cancellation by counting the number of canceled bookings for each deposit type (left plot), we will have no deposit as the highest with non-refund followed by roughly a half. This came as no surprise as nowadays people were not expected to deposit for their stays, which made them more likely to cancel their bookings without spending a penny.
However, if we looked at the cancel rate by deposit type which we took the canceled bookings divided by all the bookings under each deposit type (right plot), we had a quite counter-intuitive discover. The non_refund now had the highest rate of cancellation, leaving the other two types far behind. This implied that people actually canceled the bookings that they had paid in advance! This insight called for further investigation, ideally with a different dataset of different hotels to verify its validity.
Other special requests
Finally, we wanted to know dig a little bit further into the bizzarre finding we made earlier. What made people threw away good money by cancelling the bookings they had paid before actually arrived at the premises? Could it be because per each deposit type, the hotels could not answer their requests (lacks parking spaces, unable to respond to special requests)? We will try to look at these elements for each deposit type to find out.
We will compute the percentage of repeated guests for all bookings, average number of previous cancellations, successful bookings, car parking space requests and special requests.
| deposit_type | bookings_count | percent_repeated_guest | previous_cancellations | previous_successful_booking | car_parking_space | special_requests |
|---|---|---|---|---|---|---|
| No Deposit | 104641 | 0.0357604 | 0.0420390 | 0.1549106 | 0.0711289 | 0.6514273 |
| Non Refund | 14587 | 0.0043875 | 0.4114623 | 0.0106259 | 0.0000686 | 0.0017824 |
| Refundable | 162 | 0.0246914 | 0.0000000 | 0.0185185 | 0.1234568 | 0.1419753 |
Table ?? added more to the confusion when non-refund bookings had customers who were either quite new to the hotels (low % of repeated guests) or had made a lot of cancellations before (highest previous cancellations, lowest successful booking), rarely required parking space (lowest average requests for car parking space) and rarely posed special requests (lowest average special requests) yet still willing to pay in advance only to cancel later. To put in plain words, these customers were either too pissed off in the past (high previous cancellations), too new to the hotels or not too picky but still decided to pay first and lost all the advance payment by abadoning the bookings. This point clearly asked for further investigation to decide whether this is hotel-specific or industry-specific.
In this study, we tried to explore the dataset in many angels. We started by comparing the only two hotels featured in this study across all of the variables provided in the dataset and we found out that